---
title: "Lab Assignment 4: Spatial Predictive Analysis"
subtitle: "Spatial Predictive Analysis of Vacant and Abandoned Buildings"
author: "Ye Zhang"
date: "2025-11-16"
format:
html:
code-fold: show
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
## Introduction
In this analysis, I will build a spatial predictive model for vacant and abandoned buildings in Chicago. Abandoned buildings are a critical urban issue that affects neighborhood stability, property values, public safety, and community well-being. By analyzing the spatial patterns of reported vacant and abandoned buildings, we can better understand where these issues concentrate and potentially improve city intervention strategies.
# Setup
```{r setup}
#| message: false
#| warning: false
# Load required packages
library(tidyverse) # Data manipulation
library(sf) # Spatial operations
library(here) # Relative file paths
library(viridis) # Color scales
library(spdep) # Spatial dependence
library(FNN) # Fast nearest neighbors
library(MASS) # Negative binomial regression
library(patchwork) # Plot composition
library(knitr) # Tables
library(kableExtra) # Table formatting
library(spatstat.geom) # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE
library(lubridate) # Date handling
library(terra) # Raster operations
# Set options
options(scipen = 999) # No scientific notation
set.seed(5080) # Reproducibility
# Create consistent theme for visualizations
theme_buildings <- function(base_size = 11) {
theme_minimal(base_size = base_size) +
theme(
plot.title = element_text(face = "bold", size = base_size + 1),
plot.subtitle = element_text(color = "gray30", size = base_size - 1),
legend.position = "right",
panel.grid.minor = element_blank(),
axis.text = element_blank(),
axis.title = element_blank()
)
}
# Set as default
theme_set(theme_buildings())
cat("✓ All packages loaded successfully!\n")
```
# Part 1: Data Loading & Exploration
## Load Chicago Spatial Boundaries
```{r load-boundaries}
#| message: false
# Load police districts - we'll use these for cross-validation later
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
# Load police beats - smaller units within districts
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(Beat = beat_num)
# Load Chicago boundary - the city outline
chicagoBoundary <-
st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
st_transform('ESRI:102271')
cat("✓ Loaded spatial boundaries\n")
cat(" - Police districts:", nrow(policeDistricts), "\n")
cat(" - Police beats:", nrow(policeBeats), "\n")
```
## Load Vacant and Abandoned Buildings Data
```{r load-buildings}
#| message: false
building_raw <- read_csv(here("data", "311_Service_Requests_-_Vacant_and_Abandoned_Buildings_Reported_-_Historical_20251116.csv"))
cat("Column names in dataset:\n")
print(names(building_raw))
buildings <- building_raw %>%
# Remove rows without coordinates
filter(!is.na(`X COORDINATE`), !is.na(`Y COORDINATE`)) %>%
# Convert to sf object - coordinates are in State Plane
st_as_sf(coords = c("X COORDINATE", "Y COORDINATE"), crs = 3435) %>%
# Transform to the CRS used in the exercise (ESRI:102271)
st_transform('ESRI:102271')
cat("\n✓ Loaded abandoned buildings data\n")
cat(" - Number of building reports:", nrow(buildings), "\n")
cat(" - CRS:", st_crs(buildings)$input, "\n")
```
Loading the 311 service requests for vacant and abandoned buildings and converting them to a spatial format that matches our other Chicago data.
## Visualize Building Distribution
```{r visualize-points}
#| fig-width: 12
#| fig-height: 5
p1 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_sf(data = buildings, color = "#d62828", size = 0.1, alpha = 0.4) +
labs(
title = "Vacant & Abandoned Building Locations",
subtitle = paste0("Chicago, n = ", format(nrow(buildings), big.mark = ","))
)
buildings_coords_plot <- as.data.frame(st_coordinates(buildings))
names(buildings_coords_plot) <- c("X", "Y")
p2 <- ggplot() +
geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
geom_density_2d_filled(
data = buildings_coords_plot,
aes(x = X, y = Y),
alpha = 0.7,
bins = 8
) +
scale_fill_viridis_d(
option = "plasma",
direction = -1,
guide = "none"
) +
labs(
title = "Density Surface",
subtitle = "Kernel density estimation"
)
p1 + p2 +
plot_annotation(
title = "Spatial Distribution of Vacant & Abandoned Buildings in Chicago",
tag_levels = 'A'
)
```
Vacant and abandoned buildings are not evenly distributed across Chicago. The density surface reveals distinct concentration patterns with hot spots appearing in specific neighborhoods. There are clear clusters in the central and also south part of the city. High cluster in the south part of the city.
# Part 2: Create Fishnet Grid
## Create the Grid
```{r create-fishnet}
fishnet <- st_make_grid(
chicagoBoundary,
cellsize = 500,
square = TRUE
) %>%
st_sf() %>%
mutate(uniqueID = row_number())
fishnet <- fishnet[chicagoBoundary, ]
cat("✓ Created fishnet grid\n")
cat(" - Number of cells:", nrow(fishnet), "\n")
cat(" - Cell size: 500 x 500 meters\n")
```
Creating a grid of 500m x 500m squares that covers Chicago. This converts our point data into a regular grid that's easier to analyze and model.
## Aggregate Buildings to Grid Cells
```{r aggregate-buildings}
buildings_fishnet <- st_join(buildings, fishnet, join = st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(countBuildings = n())
fishnet <- fishnet %>%
left_join(buildings_fishnet, by = "uniqueID") %>%
mutate(countBuildings = replace_na(countBuildings, 0))
cat("\nBuilding count distribution:\n")
summary(fishnet$countBuildings)
cat("\nCells with zero buildings:",
sum(fishnet$countBuildings == 0),
"/", nrow(fishnet),
"(", round(100 * sum(fishnet$countBuildings == 0) / nrow(fishnet), 1), "%)\n")
```
Counting how many abandoned building reports occurred in each grid cell.
## Visualize the Grid
```{r visualize-fishnet}
#| fig-width: 8
#| fig-height: 6
# Map the counts in each grid cell
ggplot() +
geom_sf(data = fishnet, aes(fill = countBuildings), color = NA) +
scale_fill_viridis_c(
name = "Abandoned\nBuildings",
option = "plasma",
trans = "sqrt"
) +
labs(
title = "Vacant & Abandoned Buildings by 500m Grid Cell",
subtitle = "Chicago"
) +
theme_buildings()
```
The fishnet grid reveals clear spatial concentration of abandoned building reports. The highest counts cluster in south part of the city. Many cells have zero reports, while some hot spot cells have significantly elevated counts, indicating neighborhoods with persistent vacancy and abandonment issues.
# Part 3: Spatial Features
## Calculate k-Nearest Neighbors
```{r calculate-knn}
buildings_coords <- st_coordinates(buildings)
fishnet_coords <- st_coordinates(st_centroid(fishnet))
nn_distances <- get.knnx(
data = buildings_coords,
query = fishnet_coords,
k = 5
)
fishnet <- fishnet %>%
mutate(
buildings.nn = rowMeans(nn_distances$nn.dist)
)
cat("✓ Calculated k-nearest neighbors\n")
cat(" - k = 5 nearest buildings\n")
cat(" - Mean distance:", round(mean(fishnet$buildings.nn), 1), "meters\n")
```
For each grid cell, we find the average distance to the 5 nearest abandoned building reports. This measures how close each cell is to existing abandoned buildings.
## Visualize k-NN Feature
```{r visualize-knn}
#| fig-width: 8
#| fig-height: 6
ggplot() +
geom_sf(data = fishnet, aes(fill = buildings.nn), color = NA) +
scale_fill_viridis_c(
name = "Mean Distance\nto 5 Nearest\nBuildings (m)",
option = "viridis",
direction = -1
) +
labs(
title = "Average Distance to Nearest Abandoned Buildings",
subtitle = "Lower values = closer to existing abandoned buildings"
) +
theme_buildings()
```
The k-NN map shows the consistent yellow coloring across most of the city suggests that abandoned buildings are a pervasive issue affecting nearly all neighborhoods rather than being isolated to just a few problem areas.
# Part 4: Local Moran's I Analysis
## Calculate Local Moran's I
```{r local-morans-i}
coords <- st_centroid(fishnet) %>% st_coordinates()
weight_matrix <- knn2nb(knearneigh(coords, k = 4))
weights <- nb2listw(weight_matrix, style = "W", zero.policy = TRUE)
fishnet <- fishnet %>%
mutate(
# Calculate local Moran's I statistic
localMorans = localmoran(countBuildings, weights, zero.policy = TRUE)[,1],
# Get p-values
localMorans_pval = localmoran(countBuildings, weights, zero.policy = TRUE)[,5]
)
fishnet <- fishnet %>%
mutate(
cluster = case_when(
localMorans_pval > 0.05 ~ "Not Significant",
localMorans > 0 & countBuildings > median(countBuildings) ~ "High-High (Hot Spot)",
localMorans > 0 & countBuildings <= median(countBuildings) ~ "Low-Low (Cold Spot)",
localMorans < 0 & countBuildings > median(countBuildings) ~ "High-Low (Outlier)",
localMorans < 0 & countBuildings <= median(countBuildings) ~ "Low-High (Outlier)",
TRUE ~ "Not Significant"
)
)
# Summary
cat("Cluster distribution:\n")
table(fishnet$cluster)
```
Local Moran's I identifies clusters of similar values. Hot spots are high-abandonment areas surrounded by high-abandonment neighbors. Cold spots are low-abandonment areas surrounded by low-abandonment neighbors.
## Visualize Hot Spots
```{r visualize-morans}
#| fig-width: 10
#| fig-height: 6
# Map 1: Cluster types
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = cluster), color = NA) +
scale_fill_manual(
name = "Cluster Type",
values = c(
"High-High (Hot Spot)" = "#d7191c",
"Low-Low (Cold Spot)" = "#2c7bb6",
"High-Low (Outlier)" = "#fdae61",
"Low-High (Outlier)" = "#abd9e9",
"Not Significant" = "gray90"
)
) +
labs(
title = "Local Moran's I Clusters",
subtitle = "Spatial autocorrelation of vacant & abandoned buildings"
) +
theme_buildings()
# Map 2: Local Moran's I values
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = localMorans), color = NA) +
scale_fill_gradient2(
name = "Local\nMoran's I",
low = "#2c7bb6",
mid = "white",
high = "#d7191c",
midpoint = 0
) +
labs(
title = "Local Moran's I Values",
subtitle = "Positive = similar neighbors, Negative = dissimilar neighbors"
) +
theme_buildings()
p1 + p2
```
The hot spot analysis identifies concentrated high-abandonment clusters in south Chicago neighborhoods and some in the north central. Cold spots appear to scatter around and there are very few of them. The spatial clustering is statistically significant, confirming that building abandonment is not randomly distributed but concentrates in neighborhoods that might experiencing disinvestment.
## Calculate Distance to Hot Spots
```{r distance-to-hotspots}
hotspots <- fishnet %>%
filter(cluster == "High-High (Hot Spot)")
cat("Number of hot spots identified:", nrow(hotspots), "\n")
if(nrow(hotspots) > 0) {
# Get coordinates
hotspot_coords <- st_coordinates(st_centroid(hotspots))
fishnet_coords <- st_coordinates(st_centroid(fishnet))
nn_to_hotspot <- get.knnx(
data = hotspot_coords,
query = fishnet_coords,
k = 1
)
fishnet <- fishnet %>%
mutate(
dist_to_hotspot = nn_to_hotspot$nn.dist[,1]
)
cat("✓ Calculated distance to nearest hot spot\n")
cat(" - Number of hot spots:", nrow(hotspots), "\n")
cat(" - Mean distance to hot spot:", round(mean(fishnet$dist_to_hotspot), 1), "meters\n")
} else {
# If no hot spots, create a placeholder distance variable
fishnet <- fishnet %>%
mutate(dist_to_hotspot = 0)
cat("⚠ No hot spots identified - using placeholder distance variable\n")
}
```
Measuring how far each grid cell is from the nearest identified hot spot of building abandonment.
## Visualize Distance to Hot Spots
```{r visualize-dist-hotspot}
#| fig-width: 8
#| fig-height: 6
if(nrow(hotspots) > 0) {
ggplot() +
geom_sf(data = fishnet, aes(fill = dist_to_hotspot), color = NA) +
scale_fill_viridis_c(
name = "Distance to\nNearest Hot Spot\n(meters)",
option = "magma",
direction = -1
) +
labs(
title = "Distance to Nearest Abandonment Hot Spot",
subtitle = "Lower values = closer to hot spots"
) +
theme_buildings()
} else {
cat("No hot spots to visualize\n")
}
```
The map shows clear concentric rings radiating outward from a central hot spot in Chicago, where the lightest (near-zero distance) areas indicate the core abandonment cluster. The gradient pattern reveals that most of the northern of the city is relatively close to this hot spot, while the far north and some southern areas are more isolated from the main abandonment concentration, suggesting these neighborhoods may be less affected by the spillover effects of concentrated building vacancy.
# Part 4: Count Regression Models
## Prepare Modeling Data
```{r prepare-model-data}
fishnet_model <- fishnet %>%
st_join(policeDistricts, join = st_intersects, left = TRUE) %>%
st_drop_geometry() %>%
filter(!is.na(District)) # Remove cells outside districts
fishnet_model <- fishnet_model %>%
mutate(
buildings.nn_scaled = scale(buildings.nn)[,1],
dist_to_hotspot_scaled = scale(dist_to_hotspot)[,1]
)
cat("✓ Prepared modeling dataset\n")
cat(" - Number of cells:", nrow(fishnet_model), "\n")
cat(" - Number of districts:", n_distinct(fishnet_model$District), "\n")
```
Preparing our data for modeling by joining it with police districts (for cross-validation) and creating scaled versions of our predictor variables.
## Fit Poisson Regression
```{r poisson-model}
model_poisson <- glm(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_model,
family = poisson()
)
summary(model_poisson)
cat("\nPoisson Model AIC:", AIC(model_poisson), "\n")
```
Fitting a Poisson regression model that predicts building counts using distance to nearest buildings and distance to hot spots.
## Fit Negative Binomial Regression
```{r negbin-model}
cat("Data diagnostics:\n")
cat("Building count range:", min(fishnet_model$countBuildings), "to", max(fishnet_model$countBuildings), "\n")
cat("Mean:", round(mean(fishnet_model$countBuildings), 2), "\n")
cat("Variance:", round(var(fishnet_model$countBuildings), 2), "\n")
cat("Variance/Mean ratio:", round(var(fishnet_model$countBuildings)/mean(fishnet_model$countBuildings), 2), "\n\n")
model_nb <- tryCatch({
cat("Trying NB model with scaled predictors...\n")
glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_model,
control = glm.control(maxit = 100)
)
}, error = function(e1) {
cat("Direct fit failed. Trying with Poisson starting values...\n")
model_poisson_scaled <- glm(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_model,
family = poisson()
)
tryCatch({
glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_model,
start = coef(model_poisson_scaled),
init.theta = 1,
control = glm.control(maxit = 200, epsilon = 1e-4)
)
}, error = function(e2) {
cat("Still failing. Trying with subset of data to estimate theta...\n")
fishnet_subset <- fishnet_model %>%
filter(countBuildings < quantile(countBuildings, 0.95))
model_subset <- glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_subset
)
glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_model,
start = coef(model_subset),
init.theta = model_subset$theta,
control = glm.control(maxit = 200, epsilon = 1e-4)
)
})
})
summary(model_nb)
# Compare AIC (lower is better)
cat("\nModel Comparison:\n")
cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")
cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")
cat("\n✓ Model fitted successfully\n")
```
The lower AIC tells us which model fits better. Negative Binomial has a better fit through having a lower AIC. This large difference indicates that the data exhibits severe overdispersion - meaning the variance in abandoned building counts is much greater than the mean. The Poisson model's assumption that mean equals variance is strongly violated, making the Negative Binomial model the clearly superior choice for modeling this data. This overdispersion is typical in urban abandonment data where most areas have few or no abandoned buildings, but some hot spot areas have very high concentrations.
# Part 5: Spatial Cross-Validation (2017)
## Implement Leave-One-Group-Out CV
```{r spatial-cv}
districts <- unique(fishnet_model$District)
cv_results <- tibble()
cat("Running Leave-One-Group-Out Cross-Validation...\n")
cat("(", length(districts), "folds - one per district)\n\n")
for(i in 1:length(districts)) {
test_district <- districts[i]
train_data <- fishnet_model %>% filter(District != test_district)
test_data <- fishnet_model %>% filter(District == test_district)
model_cv <- tryCatch({
glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = train_data,
control = glm.control(maxit = 100)
)
}, error = function(e) {
model_poisson_cv <- glm(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = train_data,
family = poisson()
)
tryCatch({
glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = train_data,
start = coef(model_poisson_cv),
init.theta = 1,
control = glm.control(maxit = 100)
)
}, error = function(e2) {
train_subset <- train_data %>%
filter(countBuildings < quantile(countBuildings, 0.95))
model_subset <- glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = train_subset
)
glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = train_data,
start = coef(model_subset),
init.theta = model_subset$theta,
control = glm.control(maxit = 100)
)
})
})
test_data <- test_data %>%
mutate(
prediction = predict(model_cv, test_data, type = "response")
)
mae <- mean(abs(test_data$countBuildings - test_data$prediction))
rmse <- sqrt(mean((test_data$countBuildings - test_data$prediction)^2))
cv_results <- bind_rows(
cv_results,
tibble(
fold = i,
test_district = test_district,
n_test = nrow(test_data),
mae = mae,
rmse = rmse
)
)
cat(" Fold", i, "/", length(districts), "- District", test_district,
"- MAE:", round(mae, 2), "\n")
}
cat("\n✓ Cross-Validation Complete\n")
cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")
cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")
```
Testing our model by holding out one police district at a time, training on the rest, and seeing how well we predict the held-out district.
## Cross-Validation Results
```{r cv-results-table}
cv_results %>%
arrange(desc(mae)) %>%
kable(
digits = 2,
caption = "LOGO CV Results by District",
col.names = c("Fold", "District", "N Test Cells", "MAE", "RMSE")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
Spatial cross-validation (LOGO CV) is more appropriate than random CV because of spatial autocorrelation - nearby locations tend to have similar characteristics. District 7 had the highest MAE and RMSE, indicating it has unique abandonment patterns not well-captured by the model's spatial features or differs significantly from other districts in the training data.
# Part 6: Model Evaluation
## Generate Final Predictions
```{r final-predictions}
fishnet <- fishnet %>%
mutate(
buildings.nn_scaled = (buildings.nn - mean(fishnet_model$buildings.nn, na.rm = TRUE)) /
sd(fishnet_model$buildings.nn, na.rm = TRUE),
dist_to_hotspot_scaled = (dist_to_hotspot - mean(fishnet_model$dist_to_hotspot, na.rm = TRUE)) /
sd(fishnet_model$dist_to_hotspot, na.rm = TRUE)
)
final_model <- tryCatch({
cat("Fitting final NB model...\n")
glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_model,
control = glm.control(maxit = 100)
)
}, error = function(e) {
cat("Using Poisson starting values for final model...\n")
model_poisson_final <- glm(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_model,
family = poisson()
)
tryCatch({
glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_model,
start = coef(model_poisson_final),
init.theta = 1,
control = glm.control(maxit = 200)
)
}, error = function(e2) {
# Last resort
fishnet_subset <- fishnet_model %>%
filter(countBuildings < quantile(countBuildings, 0.95))
model_subset <- glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_subset
)
glm.nb(
countBuildings ~ buildings.nn_scaled + dist_to_hotspot_scaled,
data = fishnet_model,
start = coef(model_subset),
init.theta = model_subset$theta,
control = glm.control(maxit = 200)
)
})
})
pred_data <- fishnet %>%
st_drop_geometry() %>%
dplyr::select(uniqueID, buildings.nn_scaled, dist_to_hotspot_scaled)
fishnet <- fishnet %>%
mutate(
prediction_nb = predict(final_model, pred_data, type = "response")
)
cat("✓ Final predictions generated\n")
```
## Compare to KDE Baseline
```{r kde-baseline}
buildings_ppp <- as.ppp(st_coordinates(buildings), W = as.owin(chicagoBoundary))
kde_buildings <- density.ppp(buildings_ppp, sigma = 1000)
kde_raster <- rast(kde_buildings)
fishnet <- fishnet %>%
mutate(
kde_value = terra::extract(kde_raster, st_centroid(fishnet))[,2]
)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBuildings, na.rm = TRUE)
fishnet <- fishnet %>%
mutate(
prediction_kde = (kde_value / kde_sum) * count_sum
)
```
## Visualize Predictions
```{r visualize-predictions}
#| fig-width: 12
#| fig-height: 4
# Create three maps
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = countBuildings), color = NA) +
scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 50)) +
labs(title = "Actual Building Reports") +
theme_buildings()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 50)) +
labs(title = "Model Predictions (Neg. Binomial)") +
theme_buildings()
p3 <- ggplot() +
geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 50)) +
labs(title = "KDE Baseline Predictions") +
theme_buildings()
p1 + p2 + p3 +
plot_annotation(
title = "Actual vs. Predicted Vacant & Abandoned Buildings"
)
```
## Model Performance Comparison
```{r model-comparison-metrics}
# Calculate performance metrics
comparison <- fishnet %>%
st_drop_geometry() %>%
filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
summarize(
model_mae = mean(abs(countBuildings - prediction_nb)),
model_rmse = sqrt(mean((countBuildings - prediction_nb)^2)),
kde_mae = mean(abs(countBuildings - prediction_kde)),
kde_rmse = sqrt(mean((countBuildings - prediction_kde)^2))
)
comparison %>%
pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
separate(metric, into = c("approach", "metric"), sep = "_") %>%
pivot_wider(names_from = metric, values_from = value) %>%
kable(
digits = 2,
caption = "Model Performance Comparison",
col.names = c("Approach", "MAE", "RMSE")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
The Negative Binomial model slightly beats the KDE baseline with a MAE of 13.50 versus 15.66, meaning it's off by about 2 fewer buildings per cell - a modest 14% improvement. While the performance gain isn't huge, the model provides insight into why certain areas are at risk by showing that proximity to abandoned buildings and hot spots matters for prediction. Both approaches struggle in the same places, particularly underpredicting the most intense abandonment areas in the central west corridor and overpredicting in moderate zones.
## Map Prediction Errors
```{r prediction-errors}
#| fig-width: 10
#| fig-height: 5
fishnet <- fishnet %>%
mutate(
error_nb = countBuildings - prediction_nb,
abs_error_nb = abs(error_nb)
)
p1 <- ggplot() +
geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
scale_fill_gradient2(
name = "Error",
low = "#2166ac", mid = "white", high = "#b2182b",
midpoint = 0,
limits = c(-20, 20)
) +
labs(title = "Model Errors (Actual - Predicted)",
subtitle = "Red = underpredicted, Blue = overpredicted") +
theme_buildings()
p2 <- ggplot() +
geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
labs(title = "Absolute Model Errors") +
theme_buildings()
p1 + p2
```
The error maps show a fairly balanced mix of underprediction (red) and overprediction (blue) scattered throughout the city, with no strong systematic spatial bias in the signed errors. However, the absolute error map reveals that the largest mistakes concentrate in specific hotspot areas in the central and western parts of Chicago, indicating the model struggles most with neighborhoods experiencing the most severe abandonment where counts are highest and most variable.
# Conclusion
**Model Performance:**
- Our Negative Binomial model significantly outperformed Poisson (AIC: 16,847 vs 33,755), confirming severe overdispersion in the data where variance far exceeds the mean
- Cross-validation showed reasonable generalization with District 7 being the hardest to predict, indicating some districts have unique abandonment patterns
- The model slightly outperformed the KDE baseline (MAE: 13.50 vs 15.66), achieving about 14% improvement in prediction accuracy
**Spatial Patterns:**
- Abandoned buildings show strong spatial clustering, with most of Chicago having relatively short distances to abandoned buildings
- Hot spots concentrated in west-central Chicago form the core abandonment cluster, with concentric rings showing spillover effects
- Local Moran's I identified statistically significant hot spots (High-High clusters) primarily in central south neighborhoods
**Model Limitations:**
- The model underpredicts in the most extreme abandonment areas (central west hotspot) and overpredicts in moderate zones
- Both the regression model and KDE struggle with the same high-intensity areas, suggesting extreme abandonment is driven by factors beyond spatial proximity
- If we could add additional features like economic indicators, foreclosure rates, housing age, or policy interventions could address current systematic biases